Artificial Intelligence and Machine Learning

Personal Loan Campaign Project Project

Ravi Potlachervu

Background and Context

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective

To predict whether a liability customer will buy a personal loan or not. Which variables are most significant. Which segment of customers should be targeted more. Data Dictionary

Data Sanity Checks

Importing required libraries

Import the dataset

Understand the Dataset

Let's start by performing basic steps to understand the data such as:

Review first and last few rows of the dataset

The dataset loaded without any issues

Total no of rows and columns information.

Get the datatype information of the columns

Checking for missing values in the data

Checking for duplicate values in the dataset

Getting the statistical summary for the dataset

Get count unique values in each column

Observations

Univariate Analysis

Lets check data distribution on numerical columns

Observations

Age

Experience

Income

CCAvg

Mortgage

Observations on Categorical Attributes

Observations On categorical columns

Family

Education

Income

Personal_Loan

Security_Account

CD_Account

Online

CreditCard

ZIPCode

Bivariate Analysis.

Personal_Loan vs other variables

Let's check the variation in Personal_Loan with some of the categorical columns in our data

Over EDA observations

Data Preparation

Outlier Detection and Treatment

Creating training and test sets.

Building the model

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

Logistic Regression

Checking model performance on training set

Checking performance on test set

ROC-AUC

1. predict_proba

Predicts the probabilities for the class 0 and 1.

Input: Train or test data

Output: Returns the predicted probabilities for class 0 and 1

2. roc_curve_score

Returns the auc scores

Input:

     1. Training data
     2. Predicted Probability

Output: AUC scores between 0 and 1

3. roc_curve

Returns the fpr, tpr and threshold values which takes the original data and predicted probabilities for the class 1.

Input:

    1. Training data
    2. Predicted Probability

Output: False positive rate, true positive rate and threshold values

* ROC-AUC on training set

* ROC-AUC on test set

Model Performance Improvement

Optimal threshold using AUC-ROC curve

Optimal thresold is the value that best separated the True positive rate and False positive rate.

Checking model performance on training set

Checking model performance on test set

Let's use Precision-Recall curve and see if we can find a better threshold

The Precision-Recall curve shows the tradeoff between Precision and Recall for different thresholds. It can be used to select optimal threshold as required to improve the model improvement.

precision_recall_curve()

Returns the fpr, tpr and threshold values

Input:

    1. Original data
    2. Predicted Probability

Output: False positive rate, true positive rate and threshold values

Checking model performance on training set

Checking model performance on test set

Conclusion Linear Regression

Build Decision Tree Model

Checking model performance on training set

Checking model performance on test set

Visualizing the Decision Tree

Reducing over fitting

Using GridSearch for Hyperparameter tuning of our tree model

Checking performance on training set

Checking performance on test set

Visualizing the Decision Tree

Observations from the tree:

Using the above extracted decision rules we can make interpretations from the decision tree model like:

Interpretations from other decision rules can be made similarly

Cost Complexity Pruning

The DecisionTreeClassifier provides parameters such as min_samples_leaf and max_depth to prevent a tree from overfiting. Cost complexity pruning provides another option to control the size of a tree. In DecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter, ccp_alpha. Greater values of ccp_alpha increase the number of nodes pruned. Here we only show the effect of ccp_alpha on regularizing the trees and how to choose a ccp_alpha based on validation scores.

Total impurity of leaves vs effective alphas of pruned tree

Minimal cost complexity pruning recursively finds the node with the "weakest link". The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first. To get an idea of what values of ccp_alpha could be appropriate, scikit-learn provides DecisionTreeClassifier.cost_complexity_pruning_path that returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process. As alpha increases, more of the tree is pruned, which increases the total impurity of its leaves.

Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.

Maximum value of Recall is at 0.009 alpha, but if we choose decision tree will only have a root node and we would lose the buisness rules, instead we can choose alpha 0.001 retaining information and getting higher recall.

checking performance on training set

checking performance on test set

Creating model with 0.01 ccp_alpha

Checking performance on the training set

Checking performance on the test set

Conclusions Decision Tree Model

Recommendations